Fetch Tweets by User

NOTE: You must have Python Twitter Tools installed on the machine to run this script. You can install it by running the below cell (change the cell type in the toolbar above to Code instead of Raw NBConvert). You may need to use "! sudo easy_install twitter".

! easy_install twitter

Jupyter Notebook Style

Let's make this thing look nice.


In [10]:
from IPython.core.display import HTML
styles = open("../css/custom.css", "r").read()
HTML(styles)


Out[10]:

First we import the libraries we'll be using.


In [1]:
from twitter import *
import csv, json
import cPickle as pickle

Setup Twitter Access

You can find the OAuth credentials needed below in Twitter's application manager. Create a new app if you haven't already. After the app has been created, you'll find the necessary information under "Keys and Access Tokens". You may need to create access tokens.


In [15]:
# Twitter OAuth Credentials
consumer_key = '' # Consumer Key (API Key)
consumer_secret = '' # Consumer Secret (API Secret)
access_token = '' # Access Token
access_secret = '' # Access Token Secret

In [3]:
t = Twitter(auth=OAuth(access_token, access_secret, consumer_key, consumer_secret))

Setup Local Paths

Paths on your machine to the file you'd like to save the tweets to.

  • Example pickle, Mac: /Users/[username]/Documents/twitter-analysis/data/raw/tweets.p

In [24]:
jsonpath = '' # Path to the JSON file where retrieved tweets go
picklepath = '' # Path to the pickle file where retrieved tweets go

Whose Tweets?

Whose tweets will we be getting? Write a list of usernames (also called screen names, Twitter handles and whatnot)


In [11]:
usernames = ['UNICEFIndia','satyamevjayate','aamir_khan'] # Example: ['UNICEFIndia','satyamevjayate','UNICEF']

Handle HTTP Errors

This is from Matthew A. Russell's brilliant "Mining the Social Web, 2nd Edition (O'Reilly, 2013)"


In [6]:
import sys
import time
from urllib2 import URLError
from httplib import BadStatusLine

def make_twitter_request(t_func, max_errors=10, *args, **kw): 
    
    # A nested helper function that handles common HTTPErrors. Return an updated
    # value for wait_period if the problem is a 500 level error. Block until the
    # rate limit is reset if it's a rate limiting issue (429 error). Returns None
    # for 401 and 404 errors, which requires special handling by the caller.
    def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
    
        if wait_period > 3600: # Seconds
            print >> sys.stderr, 'Too many retries. Quitting.'
            raise e
    
        # See https://dev.twitter.com/docs/error-codes-responses for common codes
    
        if e.e.code == 401:
            print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
            return None
        elif e.e.code == 404:
            print >> sys.stderr, 'Encountered 404 Error (Not Found)'
            return None
        elif e.e.code == 429: 
            print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'
            if sleep_when_rate_limited:
                print >> sys.stderr, "Retrying in 15 minutes...ZzZ..."
                sys.stderr.flush()
                time.sleep(60*15 + 5)
                print >> sys.stderr, '...ZzZ...Awake now and trying again.'
                return 2
            else:
                raise e # Caller must handle the rate limiting issue
        elif e.e.code in (500, 502, 503, 504):
            print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \
                (e.e.code, wait_period)
            time.sleep(wait_period)
            wait_period *= 1.5
            return wait_period
        else:
            raise e

    # End of nested helper function
    
    wait_period = 2 
    error_count = 0 

    while True:
        try:
            return t_func(*args, **kw)
        except api.TwitterHTTPError, e:
            error_count = 0 
            wait_period = handle_twitter_http_error(e, wait_period)
            if wait_period is None:
                return
        except URLError, e:
            error_count += 1
            time.sleep(wait_period)
            wait_period *= 1.5
            print >> sys.stderr, 'URLError encountered. Continuing.'
            if error_count > max_errors:
                print >> sys.stderr, 'Too many consecutive errors...bailing out.'
                raise
        except BadStatusLine, e:
            error_count += 1
            time.sleep(wait_period)
            wait_period *= 1.5
            print >> sys.stderr, 'BadStatusLine encountered. Continuing.'
            if error_count > max_errors:
                print >> sys.stderr, 'Too many consecutive errors...bailing out.'
                raise

Function to Retrieve User Tweets

This is from Matthew A. Russell's brilliant "Mining the Social Web, 2nd Edition (O'Reilly, 2013)"


In [16]:
def get_user_tweets(t, screen_name=None, user_id=None, max_results=1000):
     
    assert (screen_name != None) != (user_id != None), \
    "Must have screen_name or user_id, but not both"    
    
    kw = {  # Keyword args for the Twitter API call
        'count': 200,
        'trim_user': 'false',
        'include_rts' : 'true',
        'since_id' : 1
        }
    
    if screen_name:
        kw['screen_name'] = screen_name
    else:
        kw['user_id'] = user_id
        
    max_pages = 16
    results = []
    
    tweets = make_twitter_request(t.statuses.user_timeline, **kw)
    
    if tweets is None: # 401 (Not Authorized) - Need to bail out on loop entry
        tweets = []
        
    results += tweets
    
    print('Fetched %i tweets from @%s...' % (len(tweets), screen_name))
    
    page_num = 1
    
    # Many Twitter accounts have fewer than 200 tweets so you don't want to enter
    # the loop and waste a precious request if max_results = 200.
    
    # Note: Analogous optimizations could be applied inside the loop to try and 
    # save requests. e.g. Don't make a third request if you have 287 tweets out of 
    # a possible 400 tweets after your second request. Twitter does do some 
    # post-filtering on censored and deleted tweets out of batches of 'count', though,
    # so you can't strictly check for the number of results being 200. You might get
    # back 198, for example, and still have many more tweets to go. If you have the
    # total number of tweets for an account (by GET /users/lookup/), then you could 
    # simply use this value as a guide.
    
    if max_results == kw['count']:
        page_num = max_pages # Prevent loop entry
    
    while page_num < max_pages and len(tweets) > 0 and len(results) < max_results:
    
        # Necessary for traversing the timeline in Twitter's v1.1 API:
        # get the next query's max-id parameter to pass in.
        # See https://dev.twitter.com/docs/working-with-timelines.
        kw['max_id'] = min([ tweet['id'] for tweet in tweets]) - 1 
    
        tweets = make_twitter_request(t.statuses.user_timeline, **kw)
        results += tweets

        print('Fetched %i tweets from @%s...' % (len(tweets), screen_name))
    
        page_num += 1
        
    print('Done! We fetched %i tweets from @%s' % (len(results), screen_name))

    return results[:max_results]

In [19]:
tweets = []
for username in usernames:
    try:
        data = get_user_tweets(t, screen_name=username, max_results=3200)
        for status in data:
            tweets.append(status)
    except:
        pass


Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Fetched 200 tweets from @UNICEFIndia...
Done! We fetched 3200 tweets from @UNICEFIndia
Fetched 200 tweets from @satyamevjayate...
Fetched 198 tweets from @satyamevjayate...
Fetched 199 tweets from @satyamevjayate...
Fetched 195 tweets from @satyamevjayate...
Fetched 199 tweets from @satyamevjayate...
Fetched 198 tweets from @satyamevjayate...
Fetched 199 tweets from @satyamevjayate...
Fetched 200 tweets from @satyamevjayate...
Fetched 199 tweets from @satyamevjayate...
Fetched 200 tweets from @satyamevjayate...
Fetched 200 tweets from @satyamevjayate...
Fetched 200 tweets from @satyamevjayate...
Fetched 199 tweets from @satyamevjayate...
Fetched 194 tweets from @satyamevjayate...
Fetched 198 tweets from @satyamevjayate...
Fetched 196 tweets from @satyamevjayate...
Done! We fetched 3174 tweets from @satyamevjayate
Fetched 200 tweets from @aamir_khan...
Fetched 153 tweets from @aamir_khan...
Fetched 0 tweets from @aamir_khan...
Done! We fetched 353 tweets from @aamir_khan

In [20]:
len(tweets)


Out[20]:
6727

Save Tweets

Save as JSON


In [21]:
with open(jsonpath, 'wb') as tweetsfile: # Get ready to write to output file
    json.dump(tweets, tweetsfile) # Write tweets to json file

Save as Pickle File


In [23]:
with open(picklepath, "wb") as tweetsfile:
    pickle.dump(tweets, tweetsfile) # Write tweets to pickle file

In [ ]: